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Abstract 

State-of-the-art  factor  analysis  based  channel  compensation  meth¬ 
ods  for  speaker  recognition  are  based  on  the  assumption  that 
speaker/utterance  dependent  Gaussian  Mixture  Model  (GMM) 
mean  super-vectors  can  be  constrained  to  lie  in  a  lower  dimen¬ 
sional  subspace,  which  does  not  consider  the  fact  that  conventional 
acoustic  features  may  also  be  constrained  in  a  similar  way  in  the 
feature  space.  In  this  study,  motivated  by  the  low-rank  covariance 
structure  of  cepstral  features,  we  propose  a  factor  analysis  model  in 
the  acoustic  feature  space  instead  of  the  super-vector  domain  and 
derive  a  mixture  dependent  feature  transformation.  We  demon¬ 
strate  that,  the  proposed  Acoustic  Factor  Analysis  (AFA)  transfor¬ 
mation  performs  feature  dimensionality  reduction,  de-correlation, 
variance  normalization  and  enhancement  at  the  same  time.  The 
transform  applies  a  square-root  Wiener  gain  on  the  acoustic  feature 
eigenvector  directions,  and  is  similar  to  the  signal  sub-space  based 
speech  enhancement  schemes.  We  also  propose  several  methods 
of  adaptively  selecting  the  AFA  parameter  for  each  mixture.  The 
proposed  feature  transform  is  applied  using  a  probabilistic  mixture 
alignment,  and  is  integrated  with  a  conventional  i- Vector  system. 
Experimental  results  on  the  telephone  trials  of  the  NIST  SRE  2010 
demonstrate  the  effectiveness  of  the  proposed  scheme. 

1.  Introduction 

Factor  analysis  based  channel  compensation  methods  for  speaker 
recognition  are  based  on  the  assumption  that,  speaker/utterance  de¬ 
pendent  adapted  GMM  [1]  mean  super- vectors  can  be  constrained 
to  lie  in  a  lower  dimensional  subspace  [2 — 4].  Lower  dimen¬ 
sional  speaker  and  channel  dependent  subspaces  assumption  in¬ 
spired  various  channel  compensation  schemes  such  as,  Eigenvoice 
[2],  Eigenchannel  [3]  and  Joint  Factor  Analysis  (JFA)  [3].  With 
the  introduction  of  i- Vectors,  which  are  the  latent  factors  of  the  so 
called  “total  variability”  space  [4],  research  trend  shifted  towards 
directly  applying  compensation  techniques  on  these  lower  dimen¬ 
sional  utterance  level  features,  enabling  the  development  of  fully 
Bayesian  techniques  [5,6]. 

While  super-vector  domain  factor  analysis  techniques  and  its 
derivatives  are  effective,  they  do  not  consider  the  fact  that  acous¬ 
tic  feature  vectors  can  also  be  constrained  in  a  lower  dimensional 
subspace  of  the  feature  space.  This  is  clear,  since  a  low-rank  mod¬ 
eling  of  the  super-covariance  matrix  obtained  from  different  ut¬ 
terance  GMMs  is  not  equivalent  to  a  low-rank  assumption  of  the 
acoustic  feature  covariance  matrix.  Lower  dimensional  represen¬ 
tation  of  speech  short-time  spectrum  is  a  well  established  phe¬ 
nomenon  which  motivated  a  family  of  speech  enhancement  meth¬ 
ods  known  as  the  signal  subspace  approach  [7],  This  phenomenon 
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is  also  found  to  be  valid  for  popular  acoustic  features,  such  as  Mel- 
frequency  Cepstral  Coefficients  (MFCC)  [8],  even  though  these 
features  are  processed  by  Discrete  Cosine  Transform  (DCT)  for 
de-correlation.  To  illustrate  that  acoustic  feature  covariance  matri¬ 
ces  have  close  to  zero  eigenvalues,  and  can  be  assumed  low-rank, 
we  train  a  1024  mixture  full-covariance  GMM  Universal  Back¬ 
ground  Model  (UBM)  using  60  dimensional  MFCC  features  on  a 
large  development  data  set1.  For  a  typical  mixture  of  this  UBM, 
the  covariance  matrix  is  plotted  as  an  intensity  image  in  Fig.  1(a), 
revealing  that  the  non-diagonal  components  are  indeed  significant. 
Fig.  1(b)  shows  the  sorted  eigenvalues  of  three  different  mixtures 
showing  how  the  energy  is  compacted  in  the  first  few  coefficients, 
while  the  later  ones  are  close  to  zero.  Also,  it  is  known  that  the 
first  few  eigen-directions  of  the  feature  covariance  matrix  are  rela¬ 
tively  more  speaker  dependent  [8],  further  justifying  the  low-rank 
assumption  of  features  for  a  speaker  recognition  task. 

Inspired  by  the  abovementioned  observations,  in  this  study,  we 
propose  an  acoustic  factor  analysis  [9]  scheme  for  speaker  recog¬ 
nition  and  develop  a  mixture-dependent  feature  dimensionality  re¬ 
duction  transform.  The  proposed  transformation  performs  dimen¬ 
sionality  reduction,  de-correlation,  feature  variance  normalization, 
and  enhancement,  at  once.  Also,  instead  of  hard  feature  alignment 
to  a  specific  mixture,  applying  the  transformation,  and  then  retrain¬ 
ing  the  UBM,  we  use  a  probabilistic  frame  alignment  and  transform 
the  UBM  parameters  within  the  system.  Integrating  the  proposed 
method  within  a  standard  i- Vector  system  provides  significant  gain 
in  system  performance. 

2.  Acoustic  Factor  Analysis 

In  this  section,  we  describe  the  proposed  factor  analysis  model  of 
acoustic  features,  discuss  its  formulation,  mixture- wise  application 
for  dimensionality  reduction,  advantages  and  properties. 

2.1.  Formulation 

Let  X  =  {x„  |n  =  1  •  •  •  N}  be  the  collection  of  all  acoustic  fea¬ 
ture  vectors  from  the  development  set.  Using  a  factor  analysis 
model,  a  d  x  1  feature  vector  x  €  X  can  be  represented  by, 

x  =  Wy  +  p  +  c.  (1) 

Here,  W  is  a  d  x  q  low  rank  factor  loading  matrix  that  represents 
q  <  d  bases  spanning  the  subspace  with  important  variability  in 
the  feature  space,  and  fi  is  the  d  x  1  mean  vector  of  x.  We  denote 
the  latent  variable  vector  y  ~  A/"(0, 1)  as  acoustic  factors,  which 
is  of  dimension  q  x  1.  We  assume  the  remaining  noise  component 
c  ~  A/”(0,  crT)  to  be  isotropic  and  the  model  is  thus  equivalent  to 
Probabilistic  Principal  Component  Analysis  (PPCA)  [10]. 

The  advantage  of  this  model  is  that  the  acoustic  factors  y,  ex¬ 
plains  the  correlation  between  the  feature  coefficients  x.  which  we 
believe  are  more  speaker  dependent  [8],  while  the  noise  component 
c  incorporates  the  residual  variance  of  the  data.  It  should  be  em- 

1  Details  on  feature  extraction  and  UBM  data  are  given  in  Sec.  4.1. 
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Figure  1 :  (a)  An  intensity  image  plot  showing  a  typical  covariance  matrix  of  a  full-covariance  UBM  (darker  indicate  lower  values)  trained  on  60-dimensional 
MFCC  features  (20  static+A+AA).  (b)  Sorted  eigenvalues  of  three  covariance  matrices  from  the  UBM  showing  different  energy  compaction  in  various 
mixtures.  Energy  compaction  is  the  percentage  of  top  eigenvectors  that  account  for  95%  of  total  energy.  A  typical,  low  and  high  compact  mixture  eigenvalues 
are  shown,  (c)  Distribution  of  top  UBM  mixture  posterior  probability  values  for  development  features  revealing  that  frame  alignment  is  not  always  definitive. 


phasized  that  even  though  we  denote  the  term  c  as  “noise”,  when  x 
represent  cepstral  features  c  actually  represent  convolutional  chan¬ 
nel  distortion.  Using  a  mixture  of  these  models  [10],  we  have, 

p(x)  =  wgp(x\g)  where  (2) 

9 

P(x|ff)  =  V(m9,<t9T  +  W9WJ).  (3) 

Here,  p,g,  wg,  W9  and  cr2  denote  the  mean  vector,  mixture  weight, 
factor  loadings,  and  noise  variance  for  the  g- th  AFA  mixture. 

2.2.  Mixture  dependent  transformation 

One  advantage  of  using  a  mixture  of  PPCA  models  for  acoustic 
factor  analysis  is  that  its  parameters  can  be  conveniently  extracted 
from  a  full-covariance  GMM-UBM  trained  using  the  Expectation- 
Maximization  (EM)  algorithm  [10].  The  procedure  is  given  below: 


where  Uqfl  is  a  d  x  q  matrix  whose  columns  are  the  q  leading 
eigenvectors  of  S9,  Aqjj  is  a  diagonal  matrix  containing  the  cor¬ 
responding  q  eigenvalues,  and  R9  is  a  q  x  q  arbitrary  orthogonal 
rotation  matrix.  In  this  work,  we  set  R9  =  I. 

2.2.4.  The  AFA  transformation 

The  posterior  mean  of  the  acoustic  factors  yn  can  be  used  as  the 
transformed  and  dimensionality  reduced  version  of  x„  for  the  g- th 
component  of  the  AFA  model.  This  can  be  shown  to  be  [10]: 

E{yu\xn,g}  =  (y„|x„,p)  =  A9  (x„  -  p,f)  =  zn,9  (7) 

where 

A9  =  W9M9Tand  (8) 

M9  =  Ogl  +  W9  W9.  (9) 


2.2.1.  Universal  Background  Model 

The  UBM  model  Ao,  trained  on  the  dataset  X  is  given  by, 

M 

P(X|A0)  =  ^2  WgN (Hg,  Eg)  (4) 

9= 1 

where  wg  are  the  mixture  weights,  M  is  the  total  number  of  mix¬ 
tures,  /t9  are  the  mean  vectors  and  S9  are  the  full  covariance  ma¬ 
trices.  Here,  gg  and  wg  are  the  same  as  in  (2)  and  (3). 


2.2.2.  Noise  subspace  selection 

The  AFA  parameter  q  in  (1)  defines  the  number  of  principal  axes 
to  retain,  assuming  the  lower  (d  —  q)  directions  spans  the  noise- 
only  subspace  [7],  The  maximum  likelihood  estimate  of  the  noise 
variance  <r2  for  the  g- th  mixture  is  given  by. 
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2 
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1 

d—  q 


i=q+l 


(5) 


where  A9,9+i  ■  ■  ■  Ag.d  are  the  d—q  smallest  eigenvalues  of  S9.  We 
note  that  q  can  be  different  for  each  mixture,  and  thus  in  the  later 
sections,  we  denote  it  by  q(g). 


2.2.3.  Compute  the  factor  loading  matrix 

The  maximum  likelihood  estimation  of  the  factor  loading  matrix 
W9  of  the  g- th  mixture  of  the  AFA  model  in  (2)  is  given  by  [10], 


The  matrix  A9  is  termed  the  g-thAFA  transform.  Here,  the  original 
feature  vectors  xn  are  replaced  by  the  mixture  dependent  trans¬ 
formed  vectors  z,lj9.  It  can  be  easily  shown  that  zn:9  is  nor¬ 
mally  distributed  with  a  zero  mean  and  diagonal  covariance  ma¬ 
trix  SZj  =  I  a92Aq9-\  demonstrating  that  A9  performs  mean 
normalization  and  de-correlation.  Conventionally,  a  feature  vec¬ 
tor  xn  is  aligned  with  a  mixture  g  that  yields  the  highest  posterior 
probability  p(<?|xn,  Ao),  and  the  corresponding  transformation  is 
applied  for  dimensionality  reduction  [10].  However,  as  we  observe 
the  distribution  of  max9  p(g |x„,  Ao)  for  our  development  data  in 
Fig.  1(c),  features  are  aligned  with  multiple  Gaussians  in  most 
cases  providing  values  of  max9 p(<j|xn,  Ao)  ~  0.5.  Thus,  we  pro¬ 
pose  to  apply  the  AFA  transform  using  a  probabilistic  alignment 
and  transform  the  UBM  instead  of  retraining  it.  The  new  UBM  is 
given  by, 

M 

p(z|A0)  =  wgAf  (0,  Sz  J  (10) 

i= 1 


2.3.  Feature  enhancement  in  AFA 


Expansion  of  the  transformation  matrix  A9  in  (8)  unveils  the  built- 
in  enhancement  operation  it  performs.  Substituting  the  expressions 
of  W9  and  M9  from  (6)  and  (9)  in  (8),  we  have: 


A^Aq-^A,  -o2gI 


T/2 


UqJ  =  Aq;5G9UqJ 


(11) 


where  we  utilized  the  fact  that,  Uq^Uq9  =  I  and  introduced  a 
diagonal  gain  matrix  given  by: 


W9  =  Uq9(Aqg-a9T)1/2R9  (6) 


G9=Aq; 


(Aq9  -  a2I) 


T/2 


(12) 
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Figure  2:  Input  SNR  [dB]  (£)  vs.  Wiener  gains.  Wiener  and  square-root 
Wiener  gains  are  shown  with  a  solid  (-)  and  dashed  (-  -)  line,  respectively. 


Keeping  aside  the  term  Aq9  2  in  (11),  we  observe  that  the  opera¬ 
tion  in  (7)  first  computes  the  inner  product  of  the  mean  normalized 
acoustic  feature  with  the  q  principal  eigenvectors  of  Sg,  then  for 
each  t-th  eigenvector  direction,  applies  the  gain  function  specified 
by  the  i-th  diagonal  of  G9  in  (12),  which  is  actually  a  square-root 
Weiner  gain  function  [11],  Defining  the  classic  speech  enhance¬ 
ment  terminology  a  priori  SNR  £  as  [7]  £  =  (Xg,i  —  o-g)/ag, 
and  using  this  to  express  the  gain  equations,  we  plot  the  gain  func¬ 
tions  against  £  in  Fig.  2.  Thus,  in  addition  to  de-correlation  and 
dimensionality  reduction,  the  transform  A9  also  performs  feature 
enhancement,  assuming  a  noise  variance  ag.  It  maybe  noted  that 
conventional  factor  analysis  techniques  in  super-vector  space  are 
also  known  to  be  representable  using  similar  Wiener  like  gain  func¬ 
tions  as  discussed  in  [12], 


2.4.  Feature  variance  normalization  in  AFA 

_  i 

The  term  Aq9  2  in  (IT)  normalizes  the  variance  of  the  acoustic 
feature  stream  in  the  i-th  eigen-direction,  since  A9,i  is  the  expected 
feature  variance  along  this  direction  [13].  The  AFA  transforma¬ 
tion  performs  this  normalization  in  each  mixture  in  addition  to  the 
enhancement  mentioned  in  the  previous  section.  This  process  is  in¬ 
terestingly  similar  to  the  cepstral  variance  normalization  frequently 
performed  in  the  front-end  except  for  its  domain  of  operation. 


2.5.  Adaptive  AFA  dimension  selection 


Since  the  distribution  of  eigenvalues  are  different  in  each  mixture, 
a  unique  value  of  q(g)  should  be  suitable  in  each  case.  This  is  il¬ 
lustrated  in  Fig.  1(b),  where  eigenvalues  of  three  different  mixture 
covariance  matrices  are  shown.  Typically  we  see  an  energy  com¬ 
paction  of  ~  70%,  that  is,  the  first  70%  eigenvalues  account  for 
95%  of  the  total  energy.  But  in  other  cases,  the  energy  compaction 
can  be  very  low  or  high,  demanding  that  the  AFA  retained  dimen¬ 
sion  to  be  low  or  high,  respectively.  Motivated  by  this  observation, 
we  develop  a  simple  method  of  selecting  the  AFA  dimension,  q(g). 
We  first  set  the  energy  compaction  ratio  E  close  to  0.9  ~  0.97, 
then  compute  the  sorted  eigenvalues  A i,g  of  £ g  for  each  mixture 
g.  The  optimal  dimension  retaining  E%  energy  is  then  calculated 
as: 

V3  A 

Te(sO  =  mins.f. — — 7T  >  £).  (13) 

?  V  A 

Z^i= i  /'s,» 

This  method  is  denoted  by  “AFA-Var-En”.  As  an  alternate  method, 
we  use  the  effective  rank  estimation  algorithm  in  [14],  generally 
used  for  noise  estimation  in  matrices,  to  select  the  AFA  dimension. 
A  threshold  8  £  [0, 1]  is  set  and  the  AFA  dimension  is  obtained  by: 


qs(g)  =  min  s.t. 
q 


1/2 

>  8. 


(14) 


We  denote  this  method  as  ‘AFA-Var-Rk”.  When  the  AFA  dimen¬ 
sion  fixed  for  all  mixtures,  we  denote  the  system  by  “AFA-Fix”. 


3.  AFA  integrated  i- Vector  system 

In  this  section,  we  describe  how  the  proposed  method  could  be  in¬ 
corporated  into  the  current  state-of-the-art  i- Vector  system  frame¬ 
work  [4],  First,  a  full  covariance  UBM  model,  Ao  given  by  (4),  is 
trained  on  the  development  data  vectors.  Next,  the  AFA  dimension 
q(g)  for  each  mixture  g  is  set,  and  the  noise  variance  og  is  com¬ 
puted  using  (5).  The  factor  loading  matrix  W9  and  transformation 
matrix  A9  are  then  computed  using  (6)  and  (8),  respectively.  For 
each  development  utterance  s,  the  zero  order  statistics  is  given  by, 

Ns(g)  =  '^2/yg(n)  where  7 9(n)  =  p(g |x„,  A0)  (15) 

nGs 

following  the  standard  procedure  [2.4],  However,  for  AFA,  the  first 
order  statistics  Fs  ( g )  is  extracted  using  the  transformed  features  in 
the  corresponding  mixtures  instead  of  the  original  features. 

F s{g)  =  ^2'yg(n)zn,g  =  aJ^7 9(n)(x„  -  pg)  (16) 

n£s  n£s 

For  estimating  the  total  variability  (TV)  matrix,  the  standard  pro¬ 
cedure  is  followed  [4]  using  the  new  UBM  Ao  given  in  (10),  and 
statistics  F s(g)  and  Ns(g).  It  should  be  noted  that,  in  this  case 
the  super-vector  dimension  reduces  to  K  =  X^9=i  <?(<?)  from  Add, 
and  TV  matrix  size  becomes  K  x  R.  We  define  super- vector  com¬ 
pression  ratio  a  =  K/Md,  measuring  overall  AFA  compaction. 

4.  Experiments  and  Results 

We  perform  our  experiments  on  the  male  trials  of  NIST  SRE  2010 
telephone  train/test  condition  (condition  5,  normal  vocal  effort). 

4.1.  System  Description 

For  voice  activity  detection,  a  phoneme  recognizer  [15]  combined 
with  an  energy  based  scheme  is  used.  60-dimension  feature  vec¬ 
tors  (19  MFCC  +Energy  +  A  +  A  A)  are  extracted,  using  a  25 
ms  window  with  10  ms  shift  and  Gaussianized  using  a  3-s  sliding 
window.  Gender  dependent  full  and  diagonal  covariance  UBMs 
with  1024  mixtures  are  trained  on  utterances  selected  from  Switch¬ 
board  II  Phase  2  and  3,  Switchboard  Cellular  Part  1  and  2,  and  the 
NIST  2004,  2005,  2006  SRE  enrollment  data.  For  the  TV  matrix 
training,  the  same  dataset  is  utilized.  400  dimensional  i-Vectors 
are  extracted,  whitened  and  then  length  normalized  [5],  For  ses¬ 
sion  variability  compensation  and  scoring  we  use  a  Gaussian  Prob¬ 
abilistic  Linear  Discriminant  Analysis  (PLDA)  scheme  with  a  full- 
covariance  noise  model  [5]. 

4.2.  Results  using  AFA  transformation 

We  compare  several  AFA  systems  with  our  diagonal  and  full- 
covariance  UBM  based  i-vector  systems,  denoted  by  “baseline 
diag-cov”  and  “baseline  full-cov”,  respectively.  The  following 
AFA  systems  are  built:  “AFA-Fix”  with  q(g)  =  42, 48  and  54, 
“AFA-Var-En”  with  E  =  0.90,  0.95  and  0.97,  and  “AFA-Var-Rk” 
with  8  =  0.99, 0.995  and  0.997.  In  this  experiment,  we  fix  the 
PLDA  eigenvoice  size  Nev  to  200.  The  results  are  shown  in  Ta¬ 
ble  1. 

From  the  results,  we  observe  that  the  AFA-Var-En  system  in 
general  outperforms  the  basic  AFA-Fix,  even  with  similar  com¬ 
pression  ratio  a.  The  best  results  are  obtained  for  AFA-Var-En 
(E  =  0.97)  with  an  EER  of  2.0276%  obtaining  a  15.37%  rel¬ 
ative  improvement  from  “Baseline  full-cov”.  The  AFA-Var-Rk 
(<5  =  0.997)  method  provided  the  best  DCF„U  of  0.3786.  These 
results  prove  that  the  proposed  AFA  transformation  is  successfully 
able  to  reduce  some  nuisance  directions  in  the  feature  space  pro¬ 
ducing  i-Vectors  with  better  speaker  discriminating  ability.  We  also 
observe  that  AFA  i-Vector  systems  are  computationally  less  expen¬ 
sive  compared  to  the  baseline  system,  roughly  by  a  factor  of  a. 
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Table  1:  Comparison  between  baseline  i- Vector  and  AFA  systems  with 
respect  to  %EER,  DCFou  and  DCFnew  for  Nev  =  200.  Percent  relative 
improvement  (%r)  and  super- vector  compression  ratio  ( a )  are  also  shown. 


System 

Oi 

%EER/%r 

DCFold/%r 

DCFnew/%r 

Baseline  full-cov 

1.00 

2.3959 

0.1273 

0.4534 

q  =  42 

0.70 

2.36/1.42 

0.13/1.57 

0.44/3.46 

AFA-Fix 

q  =  48 

0.80 

2.12/11.51 

0.12/4.55 

0.45/1.21 

g  =  54 

0.90 

2.25/5.96 

0.12/4.47 

0.44/3.90 

E  =  0.90 

0.59 

2.46/-2.56 

0.13/-0.47 

0.43/6.17 

AFA-Var-En 

E  =  0.95 

0.72 

2.13/11.12 

0.11/12.96 

0.40/11.55 

E  =  0.97 

0.80 

2.03/15.37 

0.12/9.42 

0.40/11.18 

5  =  0.99 

0.53 

2.40/-0.32 

0.13/1.17 

0.42/8.35 

AFA-Var-Rk 

5  =  0.995 

0.62 

2.25/6.19 

0.12/5.26 

0.42/7.39 

5  =  0.997 

0.69 

2.13/10.95 

0.11/12.56 

0.38/16.49 

Table  2:  Linear  score  fusion  of  baseline  and  AFA  systems 
Individual  system  performances 


System 

Nev 

%EER 

DCF0|(I 

DCFnew 

(i) 

Baseline  full-cov 

200 

2.3959 

0.1273 

0.4534 

(ii) 

Baseline  diag-cov 

200 

2.4422 

0.1243 

0.4609 

(iii) 

AFA-Var-En  (0.97) 

200 

2.0276 

0.1153 

0.4027 

(iv) 

AFA-Fix  (48) 

150 

2.0591 

0.1205 

0.4600 

Fusion  system  performances 


1 

Fusion  of  (i)  &  (iii) 

1.9759 

0.1080 

0.3882 

2 

Fusion  of  (i)  &  (iv) 

2.0070 

0.1103 

0.4083 

3 

Fusion  of  (i)  -  (iii) 

1.8077 

0.0993 

0.3733 

Figure  3:  DET  curves  showing  individual  subsystems:  AFA-Var-En  (E 
0.97),  Baseline  full-cov,  and  Fusion  of  (i)-(iii)  shown  in  Table  2. 


This  is  expected  since  the  computational  complexity  of  an  i- Vector 
system  is  proportional  to  the  super- vector  size  [16]. 

4.2.1.  Fusion  of  multiple  systems 

We  pick  four  of  our  systems  for  fusion:  (i)  Baseline  full-cov,  (ii) 
Baseline  diag-cov,  (iii)  AFA-Var-En  (E  =  0.97),  and  (iv)  AFA- 
Fix  (q  =  48).  The  PLDA  Nev  parameter  was  set  to  200  for  sys¬ 
tems  (i)-(iii)  and  150  for  system  (iv).  Simple  linear  fusion  was 
used  with  mean  and  variance  normalization  of  scores  to  (0, 1)  for 
calibration.  From  the  results  presented  in  Table  2,  fusion  perfor¬ 
mance  of  (i)  and  (iii)  clearly  reveals  that  AFA  and  “baseline  full- 
cov”  systems  have  complementary  information,  as  the  EER  and 
DCF  values  improve.  The  best  result  is  achieved  by  fusing  systems 
(i),  (ii)  and  (iii),  to  obtain:  EER  =  1.807%,  DCF„id  =  0.0993  and 
DCF,,™  =  0.3733.  Performance  comparison  of  the  systems  (i),  (ii) 
and  the  fusion  is  shown  in  Fig.  3  using  Detection  Error  Trade-off 
(DET)  curves.  Here,  again  we  observe  the  superiority  of  the  pro¬ 
posed  AFA-Var-En  system  compared  to  the  baseline  especially  in 
the  low  false  alarm  region,  whereas  the  fusion  system  shows  better 
performance  in  the  full  DET  curve  range. 

5.  Conclusions 

In  this  study,  we  have  proposed  a  factor  analysis  model  for  acous¬ 
tic  features  to  compensate  for  transmission  channel  mismatch  in 
speaker  recognition.  Using  the  model,  we  have  developed  a  mix¬ 
ture  dependent  feature  transform  that  performs  dimensionality  re¬ 
duction,  de-correlation,  variance  normalization  and  enhancement, 
at  once.  Instead  of  a  separate  front-end  processing,  the  proposed 
transform  has  been  integrated  within  an  i- Vector  speaker  recogni¬ 
tion  framework  using  a  probabilistic  feature  alignment  technique. 
Experimental  results  have  demonstrated  the  superiority  of  the  pro¬ 
posed  scheme  compared  to  the  baseline  i- Vector  system. 

6.  References 

[1]  D.  Reynolds,  T.  Quatieri,  and  R.  Dunn,  “Speaker  verification  using 
adapted  Gaussian  mixture  models,”  Digital  Signal  Process.,  vol.  10, 
no.  1-3,  pp.  19-41,2000. 

[2]  R  Kenny,  G.  Boulianne,  and  P.  Dumouchel,  “Eigenvoice  modeling 
with  sparse  training  data,”  IEEE  Trans.  Speech  and  Audio  Process., 
vol.  13,  no.  3,  pp.  345-354,  May  2005. 


[3]  P.  Kenny,  G.  Boulianne,  P.  Ouellet,  and  P.  Dumouchel,  “Joint  factor 
analysis  versus  Eigenchannels  in  speaker  recognition,”  IEEE  Trans. 
Audio,  Speech,  and  Lang.  Process.,  vol.  15,  no.  4,  pp.  1435-1447, 
May  2007. 

[4]  N.  Dehak,  P.  Kenny,  R.  Dehak,  P.  Dumouchel,  and  P.  Ouellet,  “Front- 
end  factor  analysis  for  speaker  verification,”  IEEE  Trans.  Audio, 
Speech,  and  Lang.  Process.,  vol.  19,  no.  99,  pp.  788  -  798,  May  2010. 

[5]  D.  Garcia-Romero  and  C.  Y.  Espy-Wilson,  “Analysis  of  i- Vector 
length  normalization  in  speaker  recognition  systems,”  in  Proc.  Inter¬ 
speech,  Florence,  Italy,  Oct.  2011,  pp.  249  -  252. 

[6]  P.  Matejka,  O.  Glembek,  F.  Castaldo,  M.  Alam,  O.  Plchot,  P.  Kenny, 
L.  Burget,  and  J.  Cemocky,  “Full-covariance  UBM  and  heavy-tailed 
PLDA  in  i-vector  speaker  verification,”  in  Proc.  ICASSP,  Florence, 
Italy,  Oct.  2011,  pp.  4828  -4831. 

[7]  Y.  Ephraim  and  H.  Van  Trees,  “A  signal  subspace  approach  for  speech 
enhancement,”  IEEE  Trans.  Speech  and  Audio  Process.,  vol.  3,  no.  4, 
pp.  251-266, Jul  1995. 

[8]  B.  Zhou  and  J.  H.  L.  Hansen,  “Rapid  discriminative  acoustic  model 
based  on  Eigenspace  mapping  for  fast  speaker  adaptation,”  IEEE 
Trans.  Speech  and  Audio  Process.,  vol.  13,  no.  4,  pp.  554  -  564,  July 
2005. 

[9]  T.  Hasan  and  J.  H.  L.  Hansen,  “Factor  analysis  of  acoustic  features 
using  a  mixture  of  probabilistic  principal  component  analyzers  for 
robust  speaker  verification,”  in  Proc.  Odyssey,  Singapore,  June  2012. 

[10]  M.  Tipping  and  C.  Bishop,  “Mixtures  of  probabilistic  principal  com¬ 
ponent  analyzers,”  Neural  computation,  vol.  11,  no.  2,  pp.  443^182, 
1999. 

[11]  S.  Vaseghi,  Advanced  signal  processing  and  digital  noise  reduction. 
Wiley,  1996. 

[12]  A.  McCree,  D.  Sturim,  and  D.  Reynolds,  “A  new  perspective  on 
GMM  subspace  compensation  based  on  PPCA  and  wiener  filtering,” 
in  Proc.  Interspeech,  Florence,  Italy,  Oct.  2011,  pp.  145  -  148. 

[13]  T.  Anderson,  “Asymptotic  theory  for  principal  component  analysis,” 
The  Annals  of  Mathematical  Statistics,  vol.  34,  no.  1,  pp.  122-148, 
1963. 

[14]  J.  Cadzow,  “SVD  representation  of  unitarily  invariant  matrices,” 
IEEE  Trans.  Acoustics,  Speech  and  Signal  Process.,  vol.  32,  no.  3, 
pp.  512-516,  Jun  1984. 

[15]  P  Schwarz,  P.  Matejka,  and  J.  Cemocky,  “Hierarchical  structures  of 
neural  networks  for  phoneme  recognition,”  in  Proc.  ICASSP,  vol.  1, 
May  2006. 

[16]  O.  Glembek,  L.  Burget,  P.  Matejka,  M.  Karafiat,  and  P  Kenny,  “Sim¬ 
plification  and  optimization  of  i-vector  extraction,”  in  Proc.  ICASSP, 
Florence,  Italy,  Oct.  201 1,  pp.  4516  -  4519. 


4 


